Data, DataViz, and Stats with the Stars

Agenda!

  • Orange? What is this Orange stuff, anyhow?
  • Throwing it All Away with Brad Pitt: Data Summaries
  • Counting Letters with Sherlock Holmes: Bar Charts
  • Nursery Rhymes with Ben Affleck: Line Charts
  • Being a Mermaid with Katie Ledecky: Box Plots
  • Jack and Rose lived happily ever after: Mosaic Plots
  • The Art of Surprise with Gabbar Singh: Permutation Tests

Orange? What is this Orange stuff, anyhow?

Installing and Getting Used to Orange

some stuff here

Brad Pitt: Throwing it All Away

Brad Pitt: Throwing it All Away

Brad Pitt: Throwing it All Away

Steven Stigler (2016) in “The Seven Pillars of Statistical Wisdom”:

  • One of the Big Ideas in Statistics is: Aggregation
  • How is it revolutionary?
  • By stipulating that, given a number of observations, you can actually gain information by throwing information away
  • In taking a simple arithmetic mean, we discard the individuality of the measures, subsuming them to one summary.

Brad Pitt: Throwing it All Away

What was he throwing away?

data table here

“OBP” as aggregate column explanation here

Counting Letters with Sherlock Holmes

Sherlock Holmes: The Adventure of the Dancing Men

In the Sherlock Holmes story, The Adventure of the Dancing Men, a criminal known to one of the characters communicates with her using a childish/child-like drawing which looks like this:

Am Here, Abe Slaney

Am Here, Abe Slaney

How would Holmes decipher this message?

Sherlock Holmes: The Adventure of the Dancing Men

  • Using Conjectures: Symbols -> Letters
    • Holmes deduces that the most common letter in the message is “E”
    • He then deduces that the second most common letter is “T”

Zipf’s Law

Zipf’s Law
  • Based on well-known Counts of letters (Zipf’s Law)

What Charts work for counting?

Variable #1 Variable #2 Chart Names Chart Shape
Qual None Bar Chart

Bar are used to show “counts” and “tallies” with respect to Qual variables. For instance, in a survey, how many people vs Gender? In a Target Audience survey on Weekly Consumption, how many low, medium, or high expenditure people?

Where’s our Data?

OK, Let’s get some data to count:

Penguins Data

And let’s for now use a pre-set Workflow in Orange

Barchart Workflow

Workflow#1: Bar Charts

  • We will look at the data
  • Make a Data dictionary
  • Identify the Qual and Quant variables
  • Prepare Counts and Bar Charts wrt Qual variables
  • In Orange! Point, Click, and See!

Data Dictionary

  • species: Species of the penguin (Qual)
  • island: Island where the penguin was observed (Qual)
  • bill_length_mm: Length of the penguin’s bill in millimeters (Quant)
  • bill_depth_mm: Depth of the penguin’s bill in millimeters (Quant)
  • flipper_length_mm: Length of the penguin’s flipper in millimeters (Quant)
  • body_mass_g: Mass of the penguin in grams (Quant)

Counting our Data

Research Question

How many penguins of different species are there in the dataset?

Wait, But Why?

  • Always count your chickens count your data before you model or infer!
  • Counts first give you an absolute sense of how much data you have.
  • Counts by different Qual variables give you a sense of the combinations you have in your data: (Male/Female) * (Species) * (Island) (Say 2 * 3 * 3 = 18 combinations in the data)
  • Counts then give an idea whether your data is lop-sided: do you have too many observations of one category(level) and too few of another category(level) in a given Qual variable?
  • Balance is important in order to draw decent inferences
  • And for ML algorithms, to train them properly.
  • Since the X-axis in bar charts is Qualitative (the bars don’t touch, remember!) it is possible to sort the bars at will.

Nursery Rhymes with Ben Affleck

Who was Solomon Grundy?

Who was Solomon Grundy?

Being a Mermaid with Katie Ledecky

Being a Mermaid with Katie Ledecky

Jack and Rose lived happily ever after

Jack and Rose lived happily ever after

The Art of Surprise with Gabbar Singh

The Art of Surprise with Gabbar Singh

Why is this slide always showing up?